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ENTROPIC WASSERSTEIN GRADIENT FLOWS 

GABRIEL PEYRE* 


Abstract. This article details a novel numerical scheme to approximate gradient flows for opti¬ 
mal transport (i.e. Wasserstein) metrics. These flows have proved useful to tackle theoretically and 
numerically non-linear diffusion equations that model for instance porous media or crowd evolutions. 
These gradient ffows define a suitable notion of weak solutions for these evolutions and they can 
be approximated in a stable way using discrete ffows. These discrete ffows are implicit Euler time 
stepping according to the Wasserstein metric. A bottleneck of these approaches is the high com¬ 
putational load induced by the resolution of each step. Indeed, this corresponds to the resolution 
of a convex optimization problem involving a Wasserstein distance to the previous iterate. Follow¬ 
ing several recent works on the approximation of Wasserstein distances, we consider a discrete ffow 
induced by an entropic regularization of the transportation coupling. This entropic regularization 
allows one to trade the initial Wasserstein fldelity term for a Kullback-Leibler divergence, which is 
easier to deal with numerically. We show how Kullback-Leibler first order proximal schemes, and 
in particular Dykstra’s algorithm, can be used to compute each step of the regularized ffow. The 
resulting algorithm is both fast, parallelizable and versatile, because it only requires multiplications 
by the Gibbs kernel where c is the ground cost and 7 > 0 the regularization strength. On 

Euclidean domains discretized on an uniform grid, this corresponds to a linear Altering (for instance a 
Gaussian Altering when c is the squared Euclidean distance) which can be computed in nearly linear 
time. On more general domains, such as (possibly non-convex) shapes or on manifolds discretized by 
a triangular mesh, following a recently proposed numerical scheme for optimal transport, this Gibbs 
kernel multiplication is approximated by a short-time heat diffusion. We show numerical illustrations 
of this method to approximate crowd motions on complicated domains as well as non-linear diffusions 
with spatially-varying coefficients. 

Key words. Optimal transport, gradient ffow, JKO ffow, Wasserstein distance, Kullback-Leibler 
divergence, Dykstra’s algorithm, crowd motion, non-linear diffusion. 
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1. Introduction. 

1.1. Optimal Transport. 

Optimal transport: from theory to applications. In the last 20 years or so, optimal 
transport (OT) has emerged as a foundational tool to analyze diverse problems at the 
interface between variational analysis, partial differential equations and probability. 
We refer to the book of Villani [64] for an introduction to these topics. It took more 
time for this notion to become progressively mainstream in various applications, which 
is largely due to the high computation cost of the corresponding (static) linear program 
of Kantorovich [44] or to the dynamical formulation of Benamou and Brenier [10]. 
However, one can now found many relevant uses of OT in very diverse fields such as 
astrophysics [41], computer vision [54], computer graphics [16], image processing [68], 
statistics [13] and machine learning [31], to name a few. 

Entropic regularization. In order to obtain fast approximations of optimal trans¬ 
port distances (a.k.a. Wasserstein distances), there has been a recent revival of the 
so-called entropic regularization method. Cuturi [31] presented this scheme in the 
machine learning community as a fully parallelizable algorithm which can make the 
method scalable to large problems. He shows that this corresponds to the application 
of the well-known iterative diagonal scaling algorithm, which is sometime referred to 
as Sinkhorn’s algorithm [58, 60, 59] or IPFP [34]. This method is also closely related 
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to Schrodinger’s problem [56] of projecting a Gibbs distribution on fixed marginal 
constraints, see [55, 47] for recent mathematical accounts on this problem. 

The major interest of this entropic approximation is that it allows one to re-cast 
various OT-related problems as optimizations over the space of probabilities endowed 
with the Kulback-Leibler divergence. The geometry of this space, as well as the 
availability of efficient first-order optimization methods, makes this novel formulation 
numerically more friendly than the original linear program formulation. The price to 
pay for such simple and efficient approaches is the presence of an extra amount of 
smoothing (in fact a blurring by the Gibbs kernel) on the obtained results. 

Variational problems involving OT. These methods have been used to solve vari¬ 
ous variational optimization problem involving the Wasserstein distance. For instance 
the computation of Wasserstein barycenters, initially proposed in [3], has been ap¬ 
proximated by entropic regularization in [32]. A more general class of problems, 
including multi-marginal transport (see [53] for recent results on this topic) as well as 
generalized Euler’s flows (see [20] for a weak formulation of Euler’s equations), partial 
transport (as defined in [40]) and capacity-constrained transport (as defined in [42]) 
have been approximated by entropic smoothing in [11]. 

Our work goes in the same direction of applying entropic regularization to speed 
up the computation of OT-related problems. Instead of considering here the mini¬ 
mization of functionals involving the Wasserstein distance, we consider here the min¬ 
imization of convex functions according to the Wasserstein distance. 

1.2. Previous Works. 

Wasserstein flows - theory. It is natural to derive various partial differential 
equations (PDE’s) as gradient flows of certain energy functionals. While it is most of 
time assumed that the flow follows the gradient as defined through the topology on 
some Hilbert space of functions, it is sometime desirable to consider more complicated 
metrics. This allows one to capture different PDE’s and also sometime to give a precise 
meaning to weak solutions of these PDE’s. One of the most striking example is the 
computation of gradient flows over spaces of probability distributions (i.e. positive and 
normalized measures) according to the topology defined by the Wasserstein metric. 
In this setting, the gradient descent cannot be understood directly as an infinitesimal 
explicit descent in the direction of some gradient, but rather as a limit of an implicit 
Euler step, as detailed in Section 2.2. This idea corresponds to the notion of gradient 
flows in metric spaces exposed in the book [4] . 

The pioneer paper of Jordan, Kinderlehrer and Otto [43] shows how one recovers 
Eokker-Planck diffusions of distributions when one minimizes entropy functionals ac¬ 
cording to the W 2 Wasserstein metric. The corresponding method are often referred to 
as‘JKO flows” in reference to these authors. Since then, many non-linear PDE’s have 
been derived as gradient flows for Wasserstein metrics, including the porous medium 
equation [51], the heat equation on manifolds [38], degenerate parabolic PDE’s [1], 
Keller-Segel equation [14] and higher order PDE’s [23]. It is also possible to define a 
suitable notion of minimizing flow that cannot be written as PDE’s due to the non¬ 
differentiability of the energy functional, a striking example being the model of crowd 
motion with congestion proposed by [49] , 

Wasserstein flows - numeries. The use of Wasserstein methods to discretize non¬ 
linear evolutions is an emerging field of research. The major difficulty lies in the high 
computational cost induced by the resolution of each step. 

The case of 1-D densities is simpler because the optimal transport metric is a fiat 
metric when re-parameterized using inverse cumulative functions. This idea is used 
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in [45, 14, 15, 2, 48]. In higher dimension, a first class of approaches uses an Eulerian 
representation of the discretized density (i.e. on a fixed grid). The resulting problem 
can be solved using interior point methods for convex energies [23] or some sort of 
linearization in conjunction with finite elements [22] or finite volumes [24] schemes. 
A second class of approaches rather uses a Lagrangian representation, which is well 
adapted to optimal transport where the thought after solution is obtained by warping 
the density at the previous iterate. This idea is at the heart of several schemes, using 
discretized warpings [25], particules systems [67], moving meshes [21] and a consistent 
discretization of the gradients of convex functions (i.e. optimal transports) [12]. 

In this article, we use an Eulerian discretization and intend at approximating 
flows for energies that are already convex in the usual (Euclidean) sense. The main 
goal is to provide a fast and quite versatile discretization scheme through the use of 
an entropic smoothing method. 

First order scheme with respect to Bregman divergences. Eirst order proximal 
optimization schemes have been recently popularized in image processing and machine 
learning, due to their simplicity and the low computational cost of each iteration. Each 
step typically requires the computation of proximal operators, which are defined as 
strictly convex optimization sub-problems, corresponding to an implicit step according 
to the distance. We refer to the book [7] for an overview of this large class of 
methods and recent developments. Note that these proximal methods have been 
used to solve the dynamical formulation of OT [10, 52]. 

Many of these proximal algorithms have been extended when one replaces the 
metric by more general Bregman divergences. The prototypical algorithm (although 
rarely applicable in its original form) that has been extended to this divergence set¬ 
ting is the so-called proximal point algorithm [37] (see [46] for an extension to more 
general, possibly non-smooth, divergences) which corresponds to iteratively applying 
the proximal operator of the function to be minimized. 

Iterative projections on convex sets is probably the simplest yet useful example 
of proximal methods. It has been extended to the general setting of Bregman’s di¬ 
vergences by Bregman [18]. This scheme actually computes the projection on the 
intersection of convex sets if these sets are affine, which is a restrictive assumption. 
The natural extension of iterative projections to generic closed convex sets is the so- 
called Dykstra’s algorithm [36, 30], which can be interpreted as a block-coordinate 
optimization on the dual problem. Dykstra’s method has been extended to the special 
case of half-spaces in [26] and to generic closed convex sets in [9, 19]. Actually, as 
we show in Section 3.2, this result extends to arbitrary proper lower-semicontinuous 
convex functions (that are not necessarily indicators of closed convex sets). Note that 
such an extension is well-known for the case of the metric [6]. 

While in this paper we only make use of Dykstra’s algorithm, it should be noted 
that many more proximal splitting algorithms are available in this Bregman’s diver¬ 
gences setting, such as Douglas-Rachford and ADMM [65], primal-dual algorithms [27] 
and hybrid proximal point algorithms [61] 

1.3. Contributions. In this paper, we present a novel numerical scheme to 
compute approximations of discrete gradient flows for Wasserstein metrics. The ap¬ 
proximation is performed by an entropic smoothing of the original OT distance. Each 
step is computed as the resolution of a convex but possibly non-smooth optimization 
problem involving a Kulback-Leibler divergence to some Gibbs kernel. We thus pro¬ 
pose in Section 3 to solve it using an extension of Dykstra’s algorithm to this class of 
problems, for which we prove the convergence to the solution. Our main finding is that 
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this scheme is both simple to implement and competitive in term of computational 
speed, since it only requires multiplications with the Gibbs kernel, which, for many 
practical scenarios, can be achieved in nearly linear time. We illustrate in Secton 4 
this point by applications to a crowd motion model involving a non-smooth congestion 
term and to non-linear diffusions with spatially varying coefficients. Lastly, Section 5 
presents a generalization of the proposed algorithm to the case were several couplings 
are optimized. We show the usefulness of this generalization to compute the gradient 
flow of a Wasserstein attraction term with congestion and to compute evolution of 
several coupled densities. 

The code to reproduce the numerical part of this article is available online^ 

1.4. Notations. In the following we consider either vectors p G {N being 
the number of discretization points) that are usually in the probability simplex 

i:N = {p€R^;j:l,Pi = l} ( 1 . 1 ) 

and couplings, that are matrices tt G We denote (p, q) = Ylf=iPiQi 

canonical inner product on and similarly on 

For some set C C (typically Q = N or Q = Nx N), we define its indicator 
function as 


VaGM'^, ic{a)=- 


0 if a e C, 
-hoc otherwise. 


To ease notations, we define 0 and i as being entry-wise operations, i.e. aQb =^' {aibi)i 

and I {ai/bi)i. We denote as 1 (1,..., 1)^ G the vector filled with ones. 

We define 


V^gN, 



if i is odd, 
if i is even. 


( 1 . 2 ) 


We define minus the entropy on both vectors and couplings (and we make this 
distinction on purpose to ease the description of the proposed methods) as 

N 

VpeR^, E{p)=''^pi{\og{pi)-l) + L^+{pi), (1.3) 

i=l 

Q 

Vtt e i;(7r) =' - 1) + '-M+(7i'i,i), (1.4) 

hi=i 


with the convention that Olog(O) = 0. 

We define the Kulback-Leibler divergence on both vectors and couplings as 


Q 


V {p, q) eR+ X KL{p\q) =' '^pi log {^-^j - Pi + qi, 


i=l 

Q 


V (tt, e e Rh"^ X KL(7r|e) = ^ tt,,,- log ( ^ ) - 

hi=i 




(1.5) 

( 1 . 6 ) 


^https://github.com/gpeyre/2015-SIIMS-wasserstein-jko/. 
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2. Entropic Discrete JKO Flows. In this article, we consider discrete flows 
(i.e. evolutions are discretized in time) of discrete probability distributions (i.e. the 
space is also discretized, and we deal with finite dimensional problems). More pre¬ 
cisely, we consider a computational grid {xi}f^^ of N points, which can be understood 
for instance as an uniform grid in a sub-set of (in the numerical illustrations of 
Section 4 we consider d = 2 ) or as vertices of a triangulation of a surface. We thus 
consider discrete probability measures on this set of points, which can be understood 
as sums of Dirac masses located at the x^’s locations, and which we represent in the 
following as vectors p G Sat in the simplex as defined in ( 1 . 1 ). 

2.1. Entropic Regularization of Wasserstein Distance. Discretized opti¬ 

mal transport of such discrete measure is defined according to some ground cost 
c G A typical scenario is when xi G are points in the Euclidean space 

and one considers Ci^j = \xi — Xj\^^ corresponding to the definition of Wasserstein 
distances. The case a = 1 corresponds to Monge’s original problem, and o = 2 to the 
so-called W 2 metric, which is by far the most studied case because of its geometrical 
properties [64] . A natural extension is to consider points Xi on a smooth manifold M 
and to define = dMi^i^Xj)"^ where dM is the geodesic distance on the manifold. 

Following several recent works (see Sections 1 . 2 ), the entropic-regularized trans¬ 
portation distance between (p, q) G for a ground cost c G is 

W^{p, q) min (c, tt) -h jE{7r) 

7rell{p,q) 

for some regularization parameter 7 ^ 0 , where the set of couplings with prescribed 
marginals (p, q) is 

n(P 5 q) {tt G ; ttI = p, TT^l = q} . 

The case 7 = 0 corresponds to the classical, un-regularized, optimal transport, and 
is a linear program. The case 7 > 0 corresponds to a strictly convex minimization 
problem, where E plays the role of a barrier function of the positive octant making the 
optimization problem better conditioned numerically. But there is more than merely 
a strict-convexiflcation of the original functional, otherwise one could have settle for 
more classical log-barrier routinely used in interior-point methods [50]. Algebraic 
properties of the entropy, and its close relationship with the Kulback-Leibler (KL) 
divergence (1.5) (see Section 3.2 for a precise statement) indeed enables closed-form 
solutions for the various marginal projections problems encountered in OT problems 
(see for instance Proposition 3.1). 

It is important to remind that is not a distance for 7 > 0, and we refer to 
Section 5.3 for a discussion on the impact of this deficiency. 

2.2. JKO Stepping. Following the initial work of [43] (which gives the name of 
the method, “JKO flows”), it is possible to discretize various non-linear PDE’s as a 
gradient flow of a functional / using implicit gradient step with respect to a Wasser¬ 
stein distance. Our method relies on the idea of replacing the initial Wasserstein 
metric by its entropic regularized approximation. 

A entropically regularized JKO iteration is an implicit descent descent step with 
respect to the “metric”. To be consistent with notations introduced in the re¬ 
maining parts of this article, we thus refer to it as a proximal operator according to 
Wy, and its definition reads, for r > 0 , 

Vg' G Prox^'^ (g) argmin W^{p, q) -1- rf{p). (2.1) 

pe^N 
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Note that since p W^{p, q) is a strictly convex and coercive function, this operator 
is uniquely defined. 

Starting from some fixed discrete density pt=o ^ one defines the discrete JKO 
follow as 


Vi > 0, pt+i =' Prox^]* (pt), (2.2) 

where rt > 0 is the step size, 7 t > 0 the entropic regularization parameter. Note that 
we allow here these parameters to vary during the iterations, although we use fixed 
parameters in the numerical sections 4 and 5. 

When Ci^j = dM{xi^Xj)‘^ (the geodesic distance squared on some smooth manifold 
Af), / is smooth and 7 = 0 (no entropic regularization), a formal computation shows 
that this scheme discretizes, as (r, 1/N) 0, the PDE 

^ = divyn (pVMif'ip))), 

where div^ and m the gradient and divergence operators on the manifold A4, 
and /' is the differential of / (the gradient for the metric on Af). We refer to [38] 
for a proof of this relationship when / is an entropy on a manifold. 

For instance, in the case where f{p) = E{p) + (p, re), this discrete flow thus 
discretizes a linear diffusion-advection on the manifold 

dp 

= Am{p) + diY m{pz) where z = Vm(uj). 

so that the mass get advected by the vector field 2 ). 

2.3. KL Proximal Operators. In order for the method that we propose to be 
applicable, the function / must be convex and should be “simple” in the sense that 
one should be able to compute easily its proximal operator for the KL divergence. 
Similarly to (2.1), this proximal operator is defined as 

Vq e PTOx^f(q) =^' argmin KL(p|g) + rf{p). (2.3) 


Note that since p KL(p|g) is a strictly convex and coercive function, this operator 
is uniquely defined. Section (4) shows two examples of such “simple” functions: an 
indicator of a box constraint (for crowd movement) and generalized entropies (for 
non-linear diffusions). 

The underlying rationale behind the framework we propose in this article is that, 
while it is in general impossible to directly compute the operator Prox^'^, there are 

many functionals for which Piox^j" is accessible either in closed form, or through a 
fast and precise algorithm. We will thus trade the application of a single implicit 
proximal step by the iterative application of several KL implicit proximal steps. Note 
that, in particular, / does not need to be smooth, which is crucial to model non-PDF 
evolutions such as crowd movements with congestion [49]. 

The main property of the KL proximal operator are recalled in Appendix A. 

3. A Bregman Proximal Splitting Approach. In this section, we show how 
to re-formulate a single entropic regularized JKO step in order to introduce a KL 
divergence penalty. This is useful to allows for the application of generalized first 
order proximal methods. 
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3.1. Reformulation as a KL Minimization. We consider a single time step 
t, and denoting q pt the previous iterate of the flow, one can re-write the JKO 
stepping operator (2.1) as 


Prox^'" (g) = ttI 

where tt G solves the following strictly convex optimization problem 

min (c, tt) -h jE{7r) + r/(7rl) + lc^ (tt) (3.1) 

TT 

where we introduced the constraint set 

Cq =■ {tt e = g} . 

The initial formulation (3.1) can be re-cast as 

min KL(7r|C) ipi{Tr) ip 2 {Tr) where | (3.2) 

TT y (f2{TT) = -/(ttI), 

where we defined the Gibbs kernel ^ as 

^ Ifl- g-c/7 g 

3.2. Dykstra Algorithm with Bregman Divergences. 

Bregman divergence and proximal map. In order to give a more general treatment 
of optimization problems of the form (3.2), that can be useful beyond the particular 
context of this article, we consider a generic Bregman divergence Dr, defined on some 
convex set V. 

We follow [9] and define a Bregman divergence (see for instance) as 

V(7r,C) e P X int(P), Dr(7r|C) = r(7r) - r(C) - (Vr(0, tt - C). 

where P is a strictly convex function, smooth on int(P) where V = dom(r) such that 
its Legendre transform 


T*{p) = max {tt, p)-T{tt), 

7t£T> 

is also smooth and strictly convex. In particular, one has that VP and VP* are 
bijective maps between int(P) and int(dom(r*)) such that VP* = VP”^. 

For r = II • P, one recovers the Euclidean norm Dp = || * |P- One has KL = Dp for 
r(7r) = E(7r), which is cased we used to tackle (3.2). Note that in general. Dp is not 
symmetric and does not satisfy the triangular inequality, so that it is not a distance. 
We refer to [9] for a table detailing many examples of Bregman’s divergences. 

Let us write the general form of problem (3.2) as 

min Dp(7r|^) + Pi{7r) + P 2 M (3.3) 

7T£T> 

where are two proper, lower-semicontinuous convex functions defined on V. 

We also assume that the following qualification constraint holds 

ri(dom((/:?i)) D ri(dom((/:?2)) H ri(dom(r)) 7^ 0. 
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(3.4) 


where ri is the relative interior and doni((/9) = {tt ; ^ +00}. 

We define the proximal map of a convex function according to this divergence 
as 


Prox^^(7r) argmin Dr(7f|7r) + (3.5) 

We assume that cp is coercive, so that Prox^^ ( tt ) is always uniquely defined by strict 
convexity. Furthermore, one has Prox^^(7r) G int(P), see [9]. 

Dykstra’s iterations. Dykstra’s algorithm starts by initializing 

and ^(0) = ^(-1) 0. 

One then iteratively defines, for ^ > 0 

ttW Ig- Prox°';j^(vr*(vr(7r(^-i)) + (3.6) 

<i|!' yie-2) vr(7r(^-i)) - vr{7r(^)), (3.7) 

where [£]2 is defined in (1.2). Note that the iterates satisfies G int(P), so that 
the algorithm is well defined. 

The iterates of this algorithm are known to converge to the solution of (3.3) 

in the case where (pi and (p 2 are indicators of convex sets, see [9]. This corresponds to 
the case where Prox^^^ for i = 1,2 are projectors according to the Bregman divergence. 

Convergence proof. This convergence result in fact caries over to the more general 
setting where {(pi,(p 2 ) are arbitrary proper and lower-semicontinuous convex func¬ 
tions. The proof follows from the fact that Dykstra’s iterations correspond to an 
alternate block minimization algorithm on the dual problem. This idea was suggested 
to us by Antonin Chambolle and Jalal Fadili. 

Proposition 1. If condition (3.4) holds, then converges to the solution 
of (3.3). 

Proof. The dual problem to (3.3) reads 

max - (pl{ui) - (P 2 {u 2 ) - r*((a - ui - U 2 ) - C{f) (3.8) 

Wi,U2 


where the constant is C{f) {a, — r(^) and where we defined a Vr(^). 

Duality means that under the domain qualification hypothesis (3.4), the minimum 
value of (3.3) and the maximum value of (3.8) are the same, and that the primal 
solution TT can be recovered from the dual one {ui,U 2 ) as 

TT = vr*(—— U2). (3.9) 

Starting from 1^2^^) = (0, 0), the alternate block optimization on (3.8) defines 

a sequence where, denoting i = [£]2 (as defined in (1.2)) and j = 3 — i E 

{1,2}, the update at iteration £ reads 

and argmax — (Pi{ui) — P^{q — Ui) (3.10) 

Ui 

where we defined a — . 

Since in (3.8) the coupling term r*((a — ui — U 2 ) between {ui,U 2 ) is smooth, a 
classical result ensures that ,U 2 ^) converges to the solution (u\,U 2 ) of (3.8), see 
for instance [28]. 
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The primal problem associated to the dual maximization (3.10) is 


min r(7ri) - {q, tt*) + ipiiiTi) = Dr(7ri|Vr*(g)) + (piini) + C (3.11) 

7Ti 


where C G M is a constant. The primal-dual relationship between the solutions 
of (3.10) and (3.11) reads 

TTi = (3.12) 


Equations (3.10) and (3.12) show that one has 

=a- - vr O Prox^f (vr*(a - m?“^^)) • (3.13) 


We now perform the following change of variables 

(,)_/ if M2 = 1, 

1 -nf) if [eh = 2, 


and = —ui9 . 

Vj2 


One then verifies that these variables satisfy the relationship (3.7) and that (3.13) is 
equivalent to (3.6). This shows by recursion that corresponds to Dykstra’s 

variables. The convergence of toward {ui^U 2 ) implies that converges 

to TT* Vr*(a — Ui — u\) which is the solution of (3.3) thanks to the primal-dual 
relationship (3.9). □ 

3.3. Dykstra’s Algorithm for KL divergence. We now consider the case 
where T = E, Dr = KL. To ease the notations, we make the change of variables 
vr(i;^^^). One has that VT = log and VT* = exp and thus one has the 

iterates 


=■ ^ and =' 1 

(3.14) 

ye > 0, =■ 

(3.15) 

ttW 

(3.16) 


Recall here that 0 and i denotes entry-wise operations. 

3.4. KL Proximal Operator for JKO Stepping. In order to be able to apply 
iterations (3.15) and (3.16), one needs to be able to compute the proximal operator 
for the KL divergence of epi and (^2- 

The following proposition shows that these proximal operators for the KL diver¬ 
gence can be indeed computed in closed form as long as one can compute in closed 
for the proximal operator of / for the KL divergence. 

Proposition 3.1. For any n , one has 

/ a \ /Prox?^(7rl) \ 

Prox^[^(7r) = Trdiag and Prox^^^(7r) = diag I - ^ - 1 tt (3.17) 


Proof. The computation of Prox^^ is obtained by combining (A.7) and (A.2) in 
the special case M = 1. The computation of Prox^^ is obtained by applying (A.6) in 
the special case of M = 1 coupling. □ 
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3.5. Dykstra Algorithm for JKO Stepping. Writing down the first order 
optimality conditions with respect to tt for problem (3.1) shows that there exists 
(a, 6) G such that the optimal tt satisfies tt = diag(a)^ diag(6). It means that, 

just as for the classical entropic regularization of optimal transport [31], the optimal 
coupling TT is a diagonal scaling of the initial Gibbs kernel This remark actually 
not only holds for the optimal tt, but it also holds for each iterate constructed by 
iterations (3.15) and (3.16) that defines G (M;!^)^. 

The following proposition makes use of this remark and shows how to actually 
implement numerically iterations (3.15) and (3.16) of the method in a fast and parallel 
way using only matrix-vector multiplications against the kernel 

Proposition 3.2. The iterates of Dykstra’s algorithm ean be written as 

ttW = diag(a(^))^diag(6(^)) and (3.18) 

(i.e. is a rank-1 matrix) where G with the initializa- 

tion 

a(0) = ^(0) ^ y(0) ^ ^(0) ^ (-gjg) 

For odd £, the update of ) reads 

(jW ^ ©ij(^-2) j^{e) ^ y (3.20) 

while for even £ it reads 

(i) 

feW = 6(^-1) © and (3.21) 

’) 

where we defined 

p(e) Proxp(a(^“i) 0 © ^(foW)), (3,22) 

The update of is always 


uW = «(^- 2 ) © 


aW 


and 




6 (^- 1 ) 

b(^) 


(3.23) 


Proof One verifies that the format (3.18) holds for the initialization (3.19) and 
that it is maintained by the update formulas (3.17). Formulas (3.20), (3.21) and (3.23) 
are obtained by identifying the different terms when plugging the format (3.18) into 
the update formulas (3.17). □ 

The Pseudo-code 1 recaps all the successive steps needed to compute the full JKO 
flow (2.2) with entropic smoothing. This resolution thus only requires to iteratively 
apply, until a suitable convergence criterion is met, the update rules (3.20), (3.21) 
and (3.23). In practice, we found that monitoring the violation of the constraint Cq to 
be both a simple and efficient way to enforce a stopping criterion This criterion allows 
furthermore to precisely enforce mass conservation, i.e. pt ^ stays normalized to 
unit mass, which is important in many practical cases. 

The crux of the method, that is extensively used in the numerical section (see in 
particular Section 4.1) is that one only needs to know how to apply the kernel f and 
its adjoint (which are in most practical situations equals), which can be achieved 
either exactly or approximately in fast and highly parallelizable manner. 
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1 . Initialize t = 0 and pt=o- 

2 . Initialize £ = 1 and set 

a(0) = 6(0) = «(0) = ^(0) = 1 , 

3. Setting q pt^ update using (3.20) is £ is odd, and using (3.21) if 

£ is even. 

4. Update using (3.23). 

5. If 0e^(a(^))-g|| > 5 or if ^ is odd, set ^ ^ ^ + 1 and go back to step 3. 

6 . Set pt+i = P^^^ as defined by (3.22), t ^ t + 1 and go back to step 2. 
Pseudo-code 1: Iterations computing the full JKO flow. The inputs are the initial 
density pt=o, the parameters (7 ,t) and the tolerance 5. The outputs are the iterates 
{Pt)t>0' 

4. Numerical Results. We now illustrate the usefulness and versatility of our 
approach to compute approximate solutions to various non-linear diffusion processes. 
The videos showing the time evolutions displayed in the figures bellow are available 
online^. 


4.1. Exact and Approximate Kernel Computation. As already highlighted 
in Section 3.5, our method is efficient if one can compute in a fast way the multiplica¬ 
tion ^p between the Gibbs kernel ^ and a vector p G In the general case, 

this is intractable because this is a full matrix-vector multiplication. Even if ^ usually 
has an exponential decay away from the diagonal, precisely capturing this decay is 
important to achieved transportation of mass effects. In particular, truncating the 
kernel to obtain a sparse matrix is prohibited. 

Translation invariant metrics. The simplest setting is when the sampling points 
(as defined in Section 2) correspond to an uniform grid discretization of a 
translation invariant distance, i.e. Ci^j = D{xi — Xj)^ for some function ^ M. 

In this case, ^ is simply a discrete convolution against the kernel T)(-)^ sampled on an 
uniform grid. For instance, when D{-) = l-l and a = 2, ^ is simply a convolution with a 
Gaussian kernel of width 7. When using periodic or Neumann boundary conditions, it 
is thus possible to implement this convolution in 0{N\og{N)) operations using Fast 
Fourier Transforms (FFT’s). There also exists a flurry of linear time approximate 
convolutions, the most popular one being Deriche’s factorization with HR filters [35]. 
We used this method to generate Figures 4.1 and 4.6. The other figures require a 
more advanced treatment because the kernel is not translation invariant. We now 
detail this approach. 

Riemannian metrics. Unfortunately, many case of practical interest correspond 
to diffusions inside non-convex domains, or even on non-Euclidean domains, typically 
a Riemannian manifold A4 (possibly with a boundary). In this setting, the natural 
choice for the ground cost c is to set Cij = xj)^, where (Im is the geodesic 

distance on the manifold. A major issue is that computing this matrix c is costly, 
for instance it would take 0{N‘^ log(A)) using Fast-Marching technics [57] on a grid 
or a triangulated mesh of N points. Even storing this non-sparse matrix can be 
prohibitive, and applying it at each step of the Dykstra algorithm would require 
0{N^) operations. Inspired by the “geodesic in heat” method of [29], it has recently 
been proposed by [62] to speed up these computations by approximating the kernel ^ 
by a Laplace or a heat kernel ^ (depending on wether a = 1 or a = 2). This means 


^https://github.com/gpeyre/2015-SIIMS-wasserstein-jko/tree/master/videos 


11 




that c does not need to be pre-computed and stored, and that, as explained bellow, 
one can apply it on the fly at each iteration using a fast sparse linear solver. In the 
numerical tests, we have used this approximation. 

To this end, one only needs to have at its disposal a discrete approximation Am 
of the Laplacian operator on the manifold A4. This is easily achieved using axis- 
aligned finite differences for manifold discretized on a rectangular grid, and this is 
the case for Figures 4.2. When considering a discretized manifold A4 which is a 
triangulated surface (as this is the case for Figure 4.3), one can use piecewise linear 
Pi finite elements, and the operator Am is then the popular Laplacian with cotangent 
weights, see [17]. 

Following [62], the kernel ^ is then approximated using L G N* steps of implicit 
Euler time discretization of the resolution of the heat equation on A4 until time 7, 

i.e. 

( 4 . 1 ) 

where (•)“^ means that one iterates L times matrix inversion. 

When 1/ = 1, and ignoring discretization errors, the result of Varadhan [63] shows 
that in the limit of small 7, ^ converges to the kernel ^ obtained when using a = 1 
(i.e. “VFi” optimal transport). As L increases, ^ converges to the heat kernel, which 
can be shown, also using [63] to be consistent with the case a = 2 (i.e. “IF2” optimal 
transport). In the numerical tests, we have used L = 10. 

Numerically, the computation of matrix/vector multiplications appearing the 
Dykstra updates thus requires the resolution of L linear systems. Since these ma¬ 
trix/vector multiplications are computed many times during the iterations, a consid¬ 
erable speed-up is achieved by first pre-computing a sparse Cholesky factorization of 
Id — ^Am and then solving the L linear systems by back-substitution [33]. Although 
there is no theoretical guarantees, we observed numerically that each step typically 
has a linear time complexity because the factorization is indeed highly sparse. We 
refer to [29] for an experimental analysis of this class of Laplacian solvers. 

4.2. Crowd Motion Model. To model crowd motion, we follow [49], where 
a congestion effect (not too many peoples can be at the same position) is enforced 
by imposing that the density p of peoples follows a JKO flow with the functional / 
defined as 


VpGM^, /(p) ='+ (w, p) (4.2) 

where k ^ ||pt=o||oo is the congestion parameter (the smaller, the more congestion) 
and w G is a potential (the mass should move along the gradient of w). 

For such a function, the KL proximal operator is easy to compute, as detailed in 
the following proposition. 

Proposition 2. One has 

Vp G Prox^j (p) = min(p © n) (4.3) 

where the min should he understood eomponents-wise. 

Proof. The formula when w = 0 is easy to show, and one then apply (A.3) in the 
case M = 1. □ 

Note that it is of course possible to consider a k that is spatially varying to model 
a non-homogeneous congestion effect. 
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t = 0 t = 10 t = 20 t = 30 t = 40 


Fig. 4.1. Display of the influence of the congestion parameter n on the evolution (the driving 
potential w is displayed on the upper-right corner of Figure 4-^)- From top to bottom, the parameters 
are set to n/\\pt=o\\oo ^ {1,2,4}}. 

Figure 4.1 shows the influence of the congestion parameter k. This figure is 
obtained for the quadratic Euclidean cost Cij = \\xi — Xj\\‘^ on a N = 200 x 200 
uniform grid with Neumann boundary conditions. The kernel ^ is computed using a 
fast Gaussian filtering on this grid as detailed in Section 4.1. 

Figure 4.2 shows various evolutions for different potentials w (guiding the crowd 
to the exit) on a manifold Ai which is a sub-set of a square in M?. This means that 
locally the Riemannian metric is Euclidean, but since the domain is non-convex, the 
transport is defined according to a geodesic distance (Im which is not the Euclidean 
distance. The discretization is achieved using the approximate heat kernel (4.1) with 
L = 10 and on a grid of = 100 x 100 points. 

Lastly Figure 4.3 shows the evolution on a triangulated mesh of 20000 vertices, 
which is also implemented using the same heat kernel, but this time on a 3-D triangu¬ 
lation using piecewise linear finite elements (hence a discrete Laplacian with cotangent 
weights [17]). 

4.3. Anisotropic Diffusion Kernels. We consider the crowd motion func¬ 
tional (4.2) over measures defined on At = now equipped with a Riemannian 
manifold structure defined by some tensor field T{x) G of symmetric positive 

matrices. We use the heat kernel approximation detailed in Section 4.1. The ker¬ 
nel (4.1) thus corresponds to a discretization of an anisotropic diffusion, which are 
routinely used to perform image restoration [66]. As the anisotropy (i.e. the maxi¬ 
mum ratio between the maximum and minium eigenvalues) of the tensors increases, 
the corresponding linear PDF becomes ill-posed, and traditional discretizations using 
finite differences are inconsistent, leading to unacceptable artifacts. We thus use the 
adaptive anisotropic stencils recently proposed in [39] to define the sparse Laplacian 
matrix discretizing the manifold Laplacian Amu{x) = div{T{x)\/u{x)). This discrete 
Laplacian is able to cope with highly anisotropic tensor fields. This is illustrated 
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cos{w) t = 0 t = 15 t = 30 t = 45 


Fig. 4.3. Display of the evolution pt on a triangulated surfaee. From top to bottom, the 
eongestion parameter is set to ^v/||pt=o||oo ^ {I 56 }}. 
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in Figure 4.4, which shows the impact of the anisotropy on the trajectory of the 
mass. The potential w creates an horizontal movement of the mass, but the circular 
shape of the tensor orientations forces the mass to rather follow a curved trajectory. 
Ultimately, mass accumulates on the left side and the congestion effect appears. 



t = 0 t = lb t = 30 t = 45 t = 60 cos(re), T 

Fig. 4.4. Display of the evolution pt using anisotropie diffusion kernels. The right column 
displays in the background the level-sets of w and the tensor field T{x) displayed as red ellipsoids. 
An ellipsoid at point x is oriented along the principal axis of Tix), and the length/width ratio is 
proportional to the anisotropy ofT{x). The metric thus favors mass movements along the direction 
of the ellipsoids. The anisotropy (ratio between maximum and minimum eigenvalues ofT{x)) is set 
respectively from top to bottom to {2,10, 30} in each of the successive rows. 


4.4. Non-linear Diffusions. To model non-linear diffusion equations, we con¬ 
sider (possibly space-varying) generalized entropies 


/(P) ='^hemAPi) 

i 


where 


V TYl ^ 1, Cyyi 



s(log(s) - 1) 
if 

m—1 


if m = 1, 
m > 1. 


(4.4) 

Here {bi)fLi is a set of weights hi ^ 0 that enable a specially varying diffusion strength, 
while is a set of exponents that enable to make the evolution more non-linear 

at certain locations. Note that the case m = 1 corresponds to minus the entropy 
defined in (1.3). 

In the case of constant weights b and exponents m, the gradient flows of these 
functionals lead to non-linear diffusions of the form dtP = Ap^. The case m = 1 
is the usual linear heat diffusion, as considered in the initial work of [43]. The case 
m = 2 is the so-called porous medium equation [51], where diffusion is slower in areas 
where the density p is small. In particular, solutions might have a compact support 
that evolves in time, on contrary to the linear heat diffusion where mass can travel 
with infinite speed. 

The following proposition, whose proof follows from writing the first order condi¬ 
tion of (3.5), details how to compute the proximal operator of h. 

Proposition 3. The proximal operator of f satisfies 


ProxKpr) = (ProxKL^^^(r,))(I,. 
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For m = 1, the proximal operator of ei reads 


V s > 0, Prox^^^ (s) = . (4.5) 

If m ^ 1, then for any 5 > 0; Prox^^^(5) = f; is the unique positive root of the 
equation 

^m-l _ I 

log(fj) + ma -= log(5) (4.6) 

m — 1 


In the numerical applications, we compute Prox^J"^ by using a few steps of New¬ 
ton iterations to solve (4.6), which can be parallelized over all the grid’s locations. 
Figure 4.5 shows examples of the energy and the corresponding proximal maps 
Prox^^^. They act as pointwise non-linear thresholding operators that are applied 
iteratively on the probability distribution being computed. In some sense, the con¬ 
gestion term (4.2) and the corresponding proximal operator (4.3) can be understood 
as a limit of this model as m ^ -hoc. 



Figure (4.6) shows illustration of the models in the case where either 6 or m is 
varying, thus producing a spatially varying flow. The initial distribution pt=o is com¬ 
puted as a white noise realization, where the pixels are independently and identically 
drawn according to a uniform distribution on [0,1] (and then p is normalized to unit 
mass). 

4.5. Non-convex Functionals. It is formally possible to apply Dykstra’s algo¬ 
rithm detailed in Section 3.4 to a non-convex function /, if one is able to compute in 
closed form the proximal operator (3.5) (which then might be a multi-valued map). 
Of course there is no hope for the resulting non-convex Dykstra’s algorithm to con¬ 
verge in general to the global minimizer of the non-convex optimization (3.2). Even 
worse, to the best of our knowledge, there is currently no proof that the non-convex 
Dykstra’s algorithm converges to a stationary point of the energy, even in the case 
of an Euclidean divergence. However, we found that applying Dykstra’s algorithm 
to non-convex functions works remarkably well in practice. Note that the closely 
related Douglas-Rachford (DR) algorithm is known to converge in some particular 
non-convex cases [5]. DR is known to perform very well on several non-convex opti¬ 
mization problems such as phase retrieval [8]. 

To test this non-convex setting, we replace the congestion box constraint (4.2) by 
the non-convex function 

f{p) =' p)- 
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(4.7) 


























Fig. 4.6. Non-linear diffusion by gradient flow on the generalized entropies (4.4). Top row: 
fixed rui = 1.4 and varying weights bi G [1,20] (1 in the boundary, 20 in the eenter). Bottom row: 
fixed bi = 1 and varying exponents mi G [1,2] (1 in the boundary, 2 in the center). 

This function enforces that the thought after solution is binary, so that each value 
Pi is in {0, k}. The proximal operator of this non-convex function can be computed 
explicitly using 

{ 0 if Pi < k/c, 

{0,/t} if Pi = /t/e, 

K, if pi> nje^ 

where e = exp(l). Note that Prox^^^^ {p)i is multi-valued at pi = /^/e, and numer¬ 
ically one needs to chose one of the two values. Figure 4.7 shows a comparison of the 
evolutions obtained with the convex and non-convex functionals. The non-convex one 
suffers from binary noise artefacts, which could be partly due to the non-convexity, 
but also to the amplification of discretization errors by the proximal mapping which 
is not Lipschitz continuous. 



Fig. 4.7. Display of crowd evolution for k = ||pt=o||oo- Top row: convex function (4.2). Bottom 
row: non-convex function (4.7). 

5. More General Functionals. In order to highlight the power of the proposed 
entropic regularization approach, we show here how to adapt the algorithm detailed 
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in Section 3 in order to deal with more involved functionals. These functionals require 
the introduction of several couplings, which in turn necessitates to develop a generic 
iterative scaling procedure derived from Dykstra’s algorithm. This new method has 
its own interest, beyond the computation of Wasserstein gradient flows. 

5.1. A Generic Diagonal Scaling Algorithm. In order to tackle a more 
general class of functions /, we consider here a generalization of problem (3.2) where 
one wants to optimize over a family tt = (tti, ..., ttm) of M couplings tt^ G a 

functional of the form 


min KL\(7r|0 + + ^ 2 M where 


(^2(7r) =' ip2in'^l), 


(5.1) 


where, for A G , KLa is the weighted KL divergence (see also (A.l)) 


M 

V(7r,0 e X (R^x^)^, KLA(7r|0 = ^ A„KL(7r,„|U) 

m=l 

and where we denoted, with a slight abuse of notations, the collection of left and right 
marginals as 


ttI = (ttiI, ..., ttmI) ^ and tt^I = (tt^ 1,..., tt^I) G (M^)^ 

and : (M^)^ ^ M are convex functions for which one can compute the proximal 
operator Prox^^^^ according to the KLa divergence. 

We wish to apply Dykstra’s iterations (3.15) and (3.16) to (5.1). This requires to 
compute the proximal operator of the functions ipi. The following proposition details 
how to achieve this using the proximal operator of the functions alone. 

Proposition 5.1. We denote, fori = 1,2, Prox^^^^(7r). We denote, for 

m = 1,..., M, pm "= and pm "= One has 

Vm e {1,..., M}, TTW = diag j and 7r|^l = diag j 
where Mi e {1,2}, (pW)m = Prox^p ((p^*')m) • 


Proof This corresponds to an application of formulas (A.6) and (A.7). □ 

The following proposition, which is similar to Proposition 3.2, explains how to 
implement the iterations of Dykstra’s algorithm using only multiplications with the 
kernels (Cm)m- 

Proposition 5.2. The iterates ... ,7r^^) of Dykstra’s algorithm ean 

he written as 

Vm e (1,... ,M}, = diag(a^))^mdiag(6^)) and . 

with the initialization 

Vm e {1,... ,M}, = vj°^ =' 1. 
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We define, Vm G {1,..., M}, 


{ -(^-1) „ ^ f7{l-l)\ 

dm (:)^m[Om ) 

rii-1) ^ tT (~{^-^)\ 
Om ^ ^mWm ) 


^f M2 = 1, 

if [i]2=2. 


The update reads, Vm G {1,..., M}, 


je) drf- j Pm® [Cm{bm ^ */ ^2 = 1 

1 tf ^2=2 

M ‘Isf' f */ M2 = 1 

“ ” 1 PmQlCidT^^)]-^ if M 2 =2 


u 


(^) 

m 


= © 


,(«-!) 

{ 1 ) 

a)n 


and 


M 


=• 7/^-2) 


0 


b 


(£- 1 ) 

m 


b 


(i) ■ 

m 


5.2. Wasserstein Attraction with Congestion. We now give a first concrete 
example of functional / for which the formulation (5.1) should be used in place of (3.2). 

Instead of advecting the mass of pt according to a fixed potential w as it is 
considered in the functional (4.2), it is possible to make it evolve toward a “target” 
distribution r G Hat by minimizing the Wasserstein distance between pt and r. It 
thus consists in considering the gradient flow of the function 

'ipeT,N, f{p) = Wj{r,p) + h{p) + i^^{p), (5.2) 

where h{p) is a function for which one can compute its KL proximal operator as 
defined in (2.3). 

We now denote q pt the previous iterate, and aim at solving a single JKO 
step (3.1). It is not possible to compute in closed form the KL proximal operator 
of the function / defined in (5.2), so that the algorithm detailed in Section 3 is not 
directly applicable. 

Instead, we re-formulate (3.1) as a KL minimization of the form (5.1) involving 
M = 2 couplings tt = ( 7ri,7r2) G (M^^^)^ and kernels f = (C15C2) "= (e~^l^ 

This encodes implicitly the solution p = ttiI = tt 2 '^ of (3.1) using the solution tt = 
(7ri,7r2) of (5.1) when introducing the functions 

'0i(pi,P2) = iv{pi,P 2 ) + Hpi) where V = {(^1,^2) ; Pi = P 2 } , 

'ip2{Pl^P2) = l^{q,r}{Pl^P2)^ 

and the weights A = (l,r) G The following proposition details how to compute 
the proximal operator of these functionals. It is important to remind that that these 
functionals as well as their respective proximal operators operate on vectors of 
not on couplings. 

Proposition 5.3. One has 

Prox^^^ (pi,P2) = {p^p) where p = Prox^^L 
= {q,r) 


pr^ Qpr 
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(5.3) 

(5.4) 




Proof. The computation of Prox^^^ follows from (A.4). The computation of 
Prox^^^ follows from (A.2). □ 

With these proximal maps at hands, and with the formula for the iterations 
detailed in Proposition 5.2, one can solve for each JKO step by computing the optimal 
(7ri,7r2) using q = pt and then updating pt+i = ttiI = 7r2l. 

Numerical Illustrations. In order to introduce some congestion, we consider here 
the function h{p) = as in (4.2), and its KL proximal operator is computed as 

detailed in (4.3). 

Figure 5.1 shows some examples of such a JKO flows computed on a rectangular 
grid of = 100 X100 points. The right hand side column shows the target distribution 
r. Note that the flow pt typically does not converge toward r as t ^ Too, because of 
the congestion effect. 



t = 0 t = 10 t = 20 t = 30 t = 40 r 


Fig. 5.1. Examples of gradient flows for the Wasserstein attraetion toward the density r dis¬ 
played on the right eolumn. The congestion parameter is set to k, = ||pt=o||oo- 


5.3. Multiple Densities Evolutions. A natural extension of the JKO flow (2.2) 
is to describe the evolution of a finite family of densities pt = {Pi,t)i by minimizing a 
function /((pi)i), where one defines the transport distance as the sum of independent 
Wasserstein distances 




The function / thus introduce a coupling between densities during the evolution. For 
simplicity we consider in the following the case of 2 densities. 

Wasserstein pairwise attraction. We first consider the case where the coupling is 
a Wasserstein attraction between the two densities 

f{Pl^P2) = C^W{pi,P2) + hi{pi) + hi{p2) 
where the functions hi are “simple” so that one can compute easily Prox)^^. 
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Denoting q = (^ 1 ,^ 2 ) the previous iterate at time t, the solution 

p = pt+i to the JKO step (3.1) can be written as 


P = {P1,P2) = (7ril,7r2l) = (ttJ l,7rj’ 1), 


where one needs to solve for M = 3 couplings (tti, 7r2, tts) the problem (5.1) with the 
functionals 


'0i(ai,a2,a3) = ivicti.as) + + -hi(ai), 

7 " 

' 02 ( 0 ^ 1 , <^ 2 , 0 ^ 3 ) = iv{cL2,a3) + i{q^}{ai) + -h2{a2) 

with KL weights A = (1, l,ra). The proximal operators of these functions are easy 
to compute as detailed in the following proposition. 

Proposition 4. For i e {1,2}, denoting j = 3 - i G {1,2}, one has 

{bm)m = {am)m where bi = bs = Prox/,, (a), bj = qj. 


Proof. The expression for (bi^bs) is obtained by using (A.4). The expression for 
bj is obtained by using (A.2). □ 

In the numerical example, we used hi{pi) i[o^K]^{pi) + ('R’i, Pi) for potentials 
{wi,W 2 ) G X M^. The KL proximal operator of these functions can be computed 
as detailed in Proposition 2. Figure 5.2 displays the results obtained on a rectangular 
grid of = 200 x 200 points. 



Fig. 5.2. Evolution with a pairwise attraction between two densities, with congestion parameter 
= ||pi,t=o||oo = ||P 2 ,t=o||oo• Display of both pi^t (red) and p2^t (green), yellow indicates a mixing. 
Top row: a = 1. Bottom row: a = 3. 

Summation eouplings. Another way to introduce some interaction between pi and 
P 2 is to consider a coupling on the sum 

f{Pi^P2) =■ h{pi +P 2 ) + (pi, wi) + (p2, W 2 ). 
for some function h for which one can compute easily Prox)^^. 
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In this case, the solution p = pt+i to the JKO step (3.1) can be written as 
P = (P15P2) = (7ril,7r2l) where the M = 2 couplings (7ri,7r2) solves problem (5.1) 
with the functionals 

7 " 

'0i(ai,a2) = -/(ai,a2), 

7 

'02(<^l,<^2) = <^ 2 ) 

and weights A = (1,1). The proximal operators of the functions are easy to compute 
as detailed in the following proposition. 

Proposition 5. One has 


Proxl^y 


(01,02) = 


Proxzit(ai + 02) 


tti + a2 

(01,02) = (51,92)- 


© (01,02) 


where hi = ai Q e ^ 

Proof. The expression for Prox^^^ is obtained by combining (A.3) and (A.6). 
The expression for Prox^^^ is obtained by using (A.2). □ 



Fig. 5.3. Evolution with a summation coupling E(pi +P 2 )- Top row: display of both pi (red) 
and p 2 (green), yellow indieates a mixing. Middle row: display of pi -\-p2, whieh evolves aeeording 
to a linear heat diffusion. Bottom row: display of pi. 

As a first example, we consider an entropic coupling h = with wi = W 2 = 0. 
Its proximal operator is computed in (4.5). A formal computation shows that, for the 
Euclidean IT2 transport on the corresponding discrete JKO steps (2.2) is intended 
at approximating the non-linear PDE over pt = (pi,t,P2,t) 

Vie {1,2}, Vi> 0 , dtPi,t = div(—^ - SIpn 

\Pl,t +P2,t 


22 










This shows that while pt follows a non-linear coupled diffusion, pi^t + P 2 ,t follows a 
linear heat diffusion. Figure 5.3 shows a numerical illustration on a regular grid of 
N = 200 X 200 points. 

As a second example, we consider a congestion coupling h = Its proximal 

operator is computed in (4.3). Figure 5.4 shows a numerical illustration on a regular 
grid of = 200 x 200 points. It shows two densities, initially supported on non¬ 
overlapping squares, moving in opposite directions under potentials (rci, 1 ^ 2 ) such that 
Vrci = (1,0)^ and \/w 2 = (—1,0)^ (constant horizontal gradients). A congestion 
shock is created by the overlap of the densities, which in turn forces the support of 
the densities to be deformed and vertically enlarged. 



t = 0 t = 10 t = 20 t = 30 t = 40 


Fig. 5.4. Evolution with a summation coupling + P 2 ) where k = q||oo = 

||P 2 ,t=o||oo • Top row: display of both pi^t (red) and p2,t (green), yellow indieates a mixing. Middle 
row: display of pi^t +P 2 ,t- Bottom row: display ofpi^t- 


Discussion and Conclusion. In this paper, we have presented a novel al¬ 
gorithm to compute approximate discrete gradient flows according to an entropic 
smoothing of the Wasserstein distance. The main interest of the method is its speed, 
simplicity and versatility. This is achieved because the iterations only require (beside 
pointwise multiplications, divisions and exponentiations) to compute the successive 
applications of a “convolution-like” operator corresponding to the Gibbs kernel asso¬ 
ciated to the metric. 

A natural question is to explore whether the discrete flow defined by ( 2 . 2 ) has a 
continuous limit when n = r ^ 0. If one uses a fixed 7 t = 7 > 0, this is not the case, 
because does not satisfies W^{p^p) = 0. More precisely, one has that 

argmin q) = ^ ) ’ 

SO that the limit for small r of pt+i defined by (2.2) is a blurred (i.e. multiplied by 
^) version of pt. Instead of using a fixed value for 7 ^, choosing 7 ^ = 7 (Tt) for some 
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carefully chosen function of Tt could allow the discrete flow to converge to the usual 
Wasserstein flow. We leave the analysis of this asymptotic setting to a future work. 
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Appendix A. KL Proximal Calculus. 

The following proposition details some useful property of the KL proximal op¬ 
erator (2.3). This enables a powerful “proximal calculus” by combining these rules, 
which eases and simplifies the implementation of the algorithms. Note that we also 
consider generalized KL divergence over sets (pi,... ,Pm) of M densities according to 
some weight A G 

M 

^ {Prn)m') {Slm)m'i ((^772)771 1 (^m)m) — E 

m=l 


Proposition 

A.l. For fipi,...,pM)’^= 

'•{( 91 ,- 


. ,Pm), one 

has 



ProxJ^^^(pi,...,PM) = 

= {Qi,- 

• • ,^m)- 


(A.2) 

For f{pi,. 

• • • ,Pm) 

=' h{pi,... ,pm) + 

, Pi), 

one has 



Proxj^^(pi, 

= Prox)^^^ (pi © e"’ 

vi/Xi 

•) 

• • • ,Pm 0 

-WM IXm ^ 

(A.3) 

For f{pi,. 

• • • ,Pm) 

=' ■ ■ ■ ,Pm) + h{pi, ... 

■ ,Pm) 

where 





'F>=' {(pi,...,pm) ; Pi 

= ... 

• = Pm} , 




one has 

Prox^^^(pi,... ,Pm) = (p, • • • ,p) where p = Prox^^i Q ■ ■ ■ Q (A.4) 

where we denoted Aj '*= Xi/^j \j and h{p) = h(p,... ,p). 

For fipi,... ,pm) =' h{pi + ... +pm) and X =' (1,..., 1), one has 


Prox^^^(pi,...,PM) 


Prox)^^(ffi + ... + pm) 

Pi + ... + Pm 


{Pi,...,Pm) 


(A.5) 


We define f{7Ti^... ^7 Vm) ='h(7ril,..., ttmL)- We denote 


Vm e M}, =' TT^l and (pi,... ,pm) ='Proxy^(pi,... ,Pm)- 


One has 


Prox^^^(7ri ,. ..,71 m) 



(A.6) 
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m 



We define /(tti, ..., ttm) /i(7rf 1,..., denote 

Vm G {1,... ,M}, =■ TT^l and (pi,... ,pm) =' Prox^^^(pi,... ,pm). 

One /ia5 

Prox^^^(7ri,... ,7rM) = ^Trdiag (A.7) 

Proof. Proof of (A.2). This is straightforward. 

Proof of (A.3). If follows from the relation 

M 

KLA((gi,... ,gM)|(pi, • • • ,Pm)) + Qi) 

i=l 

=KLa((9i, .. .,qM)\ipi © . ..,pmQ 

Proof of (A.4). We denote {qm)m =' Prox^G((Pm)m)> so that q = = ... = 

solves 

min W AmKL(g|p„) + h{q). 

q 

m 

The result follow from the relation 

^A^KL(gK) = (E„A„)KL (q\pl^ Q...Qplr). 

m 

Proof of (A.5). Denoting {qm)m = PrOX^^^((Pm)m), the first order optimality 
condition for Prox^ reads 

Vm G {1,..., M}, log ( ^ ) J^u = 0 
\PmJ 

where u £ dh{pi . +Pm)- respectively summing and subtracting these equations 
lead to 

9i + • • •+9 m = Prox^(pi + ...+Pm) and — = 

Pi Pm 

Solving for (gi, • • •, ^m) in these equations leads to the desired solution. 

Proof of (A.6). The first order condition for n being a solution of (3.5) states the 
existence of {zm)m ^ dfiPi^ • • • ^Pm) where Pm = such that 

Amlog ( — ) + Zmt^ = 0 => Tfm = diag(e“^’”/^™)7rm ^ p^ = diag(e“^™/^’”)pm, 

\ '^rn J 

which corresponds to the first order condition for {pm)m being a solution of (2.3) for 
the function h, i.e. 

{Pm)m — P^OX^ {{Pm)m) ‘ 

Finally, one obtains 

TTm = diag(e“^™/^"*)7r„ = diag ( — ) 7r„ 
and hence the desired result. 

Proof of (A.7). It is obtained by transposing formula (A.6). □ 
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